Analysis of Flight Data

Author

Nkemdibe Okweye

Published

January 7, 2026

Section 1: System Volume

Code
daily_flights <- read_csv("SQL_exports/Nkemdibe_Okweye_no252_Mini_Project_Problem07.csv",
                          show_col_types = FALSE,
                          na = c("", "NA", "NULL")) 

daily_flights <- daily_flights %>%
  mutate(
    FlightDate = as.Date(FlightDate),
    NumFlights = as.numeric(NumFlights),
    AvgFlights_Preceding3Days = as.numeric(AvgFlights_Preceding3Days)
  )

# daily_flights
# str(daily_flights)
# summary(daily_flights)

Summary Stats for April Flights

Code
summary_table <- daily_flights %>%
  summarise(
    Total_Flights = sum(NumFlights, na.rm = TRUE),
    Average_Flights_Per_Day = round(mean(NumFlights, na.rm = TRUE), 0),
    Min_Daily_Flights = min(NumFlights, na.rm = TRUE),
    Max_Daily_Flights = max(NumFlights, na.rm = TRUE)
  )

summary_table %>%
  kbl(
    col.names = c("Total Flights",
                  "Average Flights per Day",
                  "Min Daily Flights",
                  "Max Daily Flights"),
    align = c("c","c","c","c") 
  ) %>%
  kable_styling(full_width = FALSE)
Total Flights Average Flights per Day Min Daily Flights Max Daily Flights
612024 19743 1 21410

This summary table provides a high-level snapshot of system volume for April. With over 612,000 total flights and an average of roughly 19,700 flights per day.

Daily Flights vs 3-Day Rolling Average

This visualization shows the overall flight activity throughout April and compares the daily flight counts to a 3-day rolling average.

Code
plot <- ggplot(daily_flights, aes(x = FlightDate)) +
  geom_line(aes(y = NumFlights, color = "Daily Flights"), linewidth = 0.5) +
  geom_line(aes(y = AvgFlights_Preceding3Days, color = "3-Day Avg"), linewidth = 0.5) +
  scale_color_manual(values = c(
    "Daily Flights" = "#1f77b4",  
    "3-Day Avg"     = "#e15759" 
  )) +
  scale_y_continuous(
    breaks = c(0, 10000, 20000, 30000),
    expand = expansion(mult = c(0.10, 0.35))
  ) +
  labs(
    x = "Date",
    y = "Flights",
    color = "",
  ) +
  theme_minimal()

ggplotly(plot, tooltip = c("x", "y"))
Summary

The number of flights through out april stay fairly stable around 20,000 flights per day. Daily flight counts fluctuate in a clear weekly pattern: weekday volumes are consistently higher, while weekends show noticeable dips, which explains the repeating troughs in the blue line.

Flight Volume by Day of Week

Code
day_patterns <- read_csv("SQL_exports/Nkemdibe_Okweye_no252_Mini_Project_Problem03.csv",
                         show_col_types = FALSE)

day_patterns$DayOfWeek <- factor(day_patterns$DayOfWeek,
                                 levels = c("Monday", "Tuesday", "Wednesday",
                                           "Thursday", "Friday", "Saturday", "Sunday"))

This chart shows how flights were distributed across the week in April, revealing which days were the busiest and which saw lighter activity.

Code
ggplot(day_patterns, aes(x = DayOfWeek, y = NumFlights, fill = DayOfWeek)) +
  geom_col(show.legend = FALSE, width = 0.65) +
  geom_text(aes(label = comma(NumFlights)), 
            vjust = -0.4, size = 4.2, fontface = "bold") +
  scale_fill_manual(values = c(
    "Monday"    = "#4C72B0",
    "Tuesday"   = "#55A868",
    "Wednesday" = "#C44E52",
    "Thursday"  = "#8172B3",
    "Friday"    = "#CCB974",
    "Saturday"  = "#64B5CD",
    "Sunday"    = "#8C8C8C"
  )) +

  scale_y_continuous(labels = comma, expand = expansion(mult = c(0, 0.12))) +
  
  labs(
    x = "Day of Week",
    y = "Number of Flights"
  ) +

  theme_minimal(base_size = 14) +
  theme(
    plot.title = element_text(face = "bold", size = 16),
    axis.text.x = element_text(angle = 0, hjust = 0.5),
    panel.grid.major.x = element_blank(),
    panel.grid.minor = element_blank()
  )

Key Findings
  • Monday is the busiest day with 106,575 flights, reflecting business travel patterns
  • Saturday sees the lowest volume at under 90,000 flights
  • Weekly pattern is pronounced: approximately 20% variation between peak and trough
  • The rolling average reveals consistent weekday strength with weekend pullbacks

Section 2: Airline Performance

This section examines the delays and early departures across different airlines.

Code
max_delays <- read_csv("SQL_exports/Nkemdibe_Okweye_no252_Mini_Project_Problem01.csv",
                       show_col_types = FALSE)

Maximum Departure Delays

The worst delays recorded for each airline.

Code
ggplot(max_delays, aes(x = reorder(Name, MaxDepDelay),
                       y = MaxDepDelay)) +
  geom_col(fill = "#C44E52", width = 0.7) + 
  geom_text(aes(label = paste0(round(MaxDepDelay/60, 1), " hrs")),
            hjust = -0.15, size = 3.6, fontface = "plain") +
  
  coord_flip() +
  
  scale_y_continuous(
    labels = comma,
    expand = expansion(mult = c(0, 0.12))
  ) +
  
  labs(
    x = "",
    y = "Maximum Delay (minutes)"
  ) +
  
  theme_minimal(base_size = 14) +
  theme(
    panel.grid.major.y = element_blank(),
    panel.grid.minor = element_blank(),
    axis.text.y = element_text(face = "bold", size = 11),
    axis.text.x = element_text(face = "bold", size = 11)
  )

Top 5 Airlines by Maximum Delay

Code
max_delays %>%
  arrange(desc(MaxDepDelay)) %>%
  mutate(
    Hours = round(MaxDepDelay / 60, 1),
    `Delay Time` = paste0(MaxDepDelay, " min (", Hours, " hrs)")
  ) %>%
  select(Name, `Delay Time`) %>%
  head(5) %>%
  kable(
    col.names = c("Airline", "Delay Time"),
    align = c("c", "c")
  ) %>%
  kable_styling(
    bootstrap_options = c("striped", "hover"),
    full_width = FALSE
  )
Airline Delay Time
Envoy Air: MQ 2079 min (34.6 hrs)
American Airlines Inc.: AA 1663 min (27.7 hrs)
SkyWest Airlines Inc.: OO 1424 min (23.7 hrs)
ExpressJet Airlines LLC: EV 1412 min (23.5 hrs)
Republic Airline: YX 1355 min (22.6 hrs)

Maximum Early Departures

While delays frustrate passengers, early departures can be equally problematic as passengers arriving at the scheduled time could miss their flights entirely.

Code
early_deps <- read_csv("SQL_exports/Nkemdibe_Okweye_no252_Mini_Project_Problem02.csv",
                       show_col_types = FALSE)

ggplot(early_deps, aes(x = reorder(Name, EarlyDep),
                       y = EarlyDep)) +
  geom_col(fill = "#2ca02c", width = 0.7) +  
  geom_text(aes(label = paste0(round(EarlyDep), " min")),
            hjust = -0.15, size = 3.6, fontface = "plain") +
  
  coord_flip() +
  
  scale_y_continuous(
    labels = comma,
    expand = expansion(mult = c(0, 0.12))
  ) +
  
  labs(
    x = "",
    y = "Early Departure (minutes)"
  ) +
  
  theme_minimal(base_size = 14) +
  theme(
    panel.grid.major.y = element_blank(),
    panel.grid.minor = element_blank(),
    axis.text.y = element_text(face = "bold", size = 11),
    axis.text.x = element_text(face = "bold", size = 11)
  )

Key Findings
  • Maximum delay recorded: 2,620 minutes (43.7 hours)—over 1.8 days!
  • Multiple airlines had delays exceeding 24 hours, suggesting mechanical issues or severe weather disruptions
  • Early departures ranged up to 60 minutes

Section 3: Airport Performance

This section identifies which airports struggle most with on-time performance.

Worst Performing Airport Overall

Code
worst_airport <- read_csv("SQL_exports/Nkemdibe_Okweye_no252_Mini_Project_Problem04.csv",
                          show_col_types = FALSE)

worst_airport %>%
  mutate(AverageDepDelay = round(AverageDepDelay, 2)) %>%
  kable(
    col.names = c("Airport Name & Location", "Code", "Average Delay (minutes)"),
    align = c("l", "c", "c")
  ) %>%
  kable_styling(
    bootstrap_options = c("striped", "hover", "condensed"),
    full_width = FALSE,
    font_size = 14
  ) %>%
  row_spec(1, bold = TRUE, color = "red")
Airport Name & Location Code Average Delay (minutes)
North Bend/Coos Bay, OR: Southwest Oregon Regional OTH 46.67

Every flight departing from this airport faces, on average, a 47-minute delay. This isn’t an outlier problem; it’s systemic.

Worst Airport for Each Airline

This analysis reveals each carrier’s most problematic airport.

Code
airline_worst <- read_csv("SQL_exports/Nkemdibe_Okweye_no252_Mini_Project_Problem05.csv",
                          show_col_types = FALSE)

airline_worst %>%
  arrange(desc(MaxAvgDepDelay)) %>%
  head(5) %>%
  mutate(MaxAvgDepDelay = round(MaxAvgDepDelay, 2)) %>%
  kable(col.names = c("Airline", "Airport Name", "Avg Delay (min)"), align = c("c", "c", "c"),
        caption = "<b>Top 5 Worst Airline–Airport Combinations</b>") %>%
  kable_styling(bootstrap_options = c("striped", "hover"))
Top 5 Worst Airline–Airport Combinations
Airline Airport Name Avg Delay (min)
United Air Lines Inc.: UA Oklahoma City, OK: Will Rogers World 200.00
ExpressJet Airlines LLC: EV Dallas/Fort Worth, TX: Dallas/Fort Worth International 127.00
Endeavor Air Inc.: 9E Bristol/Johnson City/Kingsport, TN: Tri Cities 122.00
Mesa Airlines Inc.: YV Fort Myers, FL: Southwest Florida International 94.07
Republic Airline: YX Houston, TX: William P Hobby 89.00
Code
airline_worst_top <- airline_worst %>%
  arrange(desc(MaxAvgDepDelay)) %>%
  head(15)

airline_worst_top <- airline_worst_top %>%
  mutate(AirportCode = sub(".*: ", "", AirportName))
Code
airline_worst_top %>%
  mutate(Label = paste(AirlineName, "->", AirportCode)) %>%
  ggplot(aes(x = "", y = reorder(Label, MaxAvgDepDelay), fill = MaxAvgDepDelay)) +
  geom_tile(color = "white") +
  geom_text(aes(label = round(MaxAvgDepDelay, 1)), color = "black", size = 4) +
  scale_fill_gradient(low = "#FEE5D9", high = "#CB181D") +
  labs(
    x = "",
    y = "",
    fill = "Avg Delay (min)",
    title = "Worst Airport for Each Airline"
  ) +
  theme_minimal() +
  theme(
    axis.text.x = element_blank(),
    axis.text.y = element_text(face = "bold")  
  )


Section 4: Cancellations

This section examines the scale and causes of cancellations.

Total Cancellations

Code
cancellations <- read_csv("SQL_exports/Nkemdibe_Okweye_no252_Mini_Project_Problem06a.csv",
                          show_col_types = FALSE)

total_cancelled <- cancellations$NumCancelledFlights[1]

total_flights_approx <- sum(day_patterns$NumFlights)
cancel_rate <- (total_cancelled / total_flights_approx) * 100
The Scale of Disruption

14,488 flights were cancelled in April 2019

  • That’s approximately 483 cancellations per day
  • Represents about 2.37% of all scheduled flights
Code
tibble(
  Category = c("Completed Flights", "Cancelled Flights"),
  Count = c(total_flights_approx - total_cancelled, total_cancelled)
) %>%
  ggplot(aes(x = Category, y = Count, fill = Category)) +
  geom_col(show.legend = FALSE, width = 0.6) +
  geom_text(aes(label = comma(Count)),
            vjust = -0.4, size = 4.5, fontface = "bold") +
  scale_fill_manual(values = c(
    "Completed Flights" = "#4C72B0",  
    "Cancelled Flights" = "#C44E52"  
  )) +
  scale_y_continuous(labels = comma,
                     expand = expansion(mult = c(0, 0.10))) +
  labs(
    x = "",
    y = "Number of Flights"
  ) +
  theme_minimal(base_size = 14) +
  theme(
    panel.grid.major.x = element_blank(),
    panel.grid.minor   = element_blank(),
    axis.text.x        = element_text(face = "bold")
  )

Primary Drivers of Flight Cancellations

Code
cancel_reasons <- read_csv("SQL_exports/Nkemdibe_Okweye_no252_Mini_Project_Problem06b.csv",
                           show_col_types = FALSE)

reason_summary <- cancel_reasons %>%
  group_by(CancellationCode, Reason) %>%
  summarise(Airports = n(), .groups = "drop") %>%
  arrange(desc(Airports))
Code
reason_summary %>%
  ggplot(aes(x = reorder(Reason, Airports),
             y = Airports,
             fill = Reason)) +
  geom_col(width = 0.6) +
  geom_text(aes(label = Airports),
            hjust = -0.2,                    
            size = 4.5,
            fontface = "bold") +
  coord_flip() +
  scale_fill_manual(values = c(
    "Weather"              = "#4C72B0",
    "Carrier"              = "#C44E52",
    "National Air System"  = "#55A868",
    "Security"             = "#8172B3"
  )) +
  scale_y_continuous(expand = expansion(mult = c(0, 0.15))) +
  labs(
    title = "Primary Cancellation Reasons by Airport Count",
    x = "",
    y = "Number of Airports",
    fill = "Reason"
  ) +
  theme_minimal(base_size = 14) +
  theme(
    axis.text.y = element_text(face = "bold"),
    panel.grid.major.y = element_blank(),
    plot.title = element_text(face = "bold", size = 15)
  )

Key Findings
  • Weather dominates: 250 airports cite weather as the primary cancellation cause
  • Carrier issues are the leading cause at 56 airports, representing controllable operational problems
  • National Air System (NAS) factors affect 11 airports
  • April’s spring weather patterns explain the weather-heavy cancellation profile
  • The geographic variation suggests localized weather vulnerability in certain regions
Code
cancel_reasons %>%
  select(AirportName, CancellationCode, Reason, NumCancellations) %>%
  head(10) %>%
  kable(col.names = c("Airport Location", "Code", "Primary Reason", "# Cancellations"),
        caption = "Cancellation Reason by Airport (Top 10)") %>%
  kable_styling(bootstrap_options = c("striped", "hover"))
Cancellation Reason by Airport (Top 10)
Airport Location Code Primary Reason # Cancellations
Dallas/Fort Worth, TX: Dallas/Fort Worth International B Weather 888
Chicago, IL: Chicago O'Hare International B Weather 844
Denver, CO: Denver International B Weather 467
Charlotte, NC: Charlotte Douglas International B Weather 382
Minneapolis, MN: Minneapolis-St Paul International B Weather 294
Dallas, TX: Dallas Love Field B Weather 231
Chicago, IL: Chicago Midway International B Weather 218
Houston, TX: William P Hobby A Carrier 208
Houston, TX: George Bush Intercontinental/Houston B Weather 195
New York, NY: LaGuardia B Weather 187

Conclusion

1. Business Travel Dominates the Industry

The pronounced Monday peak (106,575 flights) and Saturday trough demonstrate that business travel, not leisure,. This weekday-centric pattern shows airlines optimize for corporate travelers returning from weekends and beginning weekly business trips.

2. Extreme Delays Reveal Operational vulnerabilities

While most flights operate smoothly, the presence of 43+ hour delays shows the huge gap between routine operations and worst-case scenarios highlighting the industry’s vulnerability.

3. Small Airports Face Disproportionate Challenges

Southwest Oregon Regional Airport’s 47-minute average delay . Smaller regional airports lack the infrastructure, redundancy, and resources to absorb disruptions, leading to persistent performance problems that affect every departure.

4. Weather Remains the Uncontrollable Wild Card

With weather being the dominant cancellation cause across the majority of airports, airlines face an operational reality that no amount of planning can fully overcome. However, carrier-related cancellations represent controllable failures where operational improvements could make a difference.

Recommendations

Operational Improvements

  1. Target Problem Airports: Direct resources to airports with chronic delay issues (OTH, etc.)
  2. Weather Preparedness: Enhance forecasting and contingency planning for April weather patterns
  3. Schedule Buffer: Build in realistic turnaround times to prevent cascading delays
  4. Early Departure Controls: Implement policies to prevent flights from leaving significantly ahead of schedule

Strategic Priorities

  1. Business Traveler Focus: Given Monday’s dominance, optimize Monday operations and customer service
  2. Regional Airport Support: Invest in infrastructure or operational support for struggling smaller airports
  3. Cancellation Transparency: Provide clearer communication about cancellation causes and rebooking options
  4. Data-Driven Resource Allocation: Use airport-specific and airline-specific delay patterns to guide staffing and equipment deployment

Data Source & Methodology

Data Source: U.S. Department of Transportation, Bureau of Transportation Statistics

Dataset: Airline On-Time Performance and Causes of Delay

Time Period: April 2019

Records Analyzed: 600,000+ individual flight records

Analysis Tools: SQL for data querying, R with ggplot2 for visualization

Source: https://www.transtats.bts.gov/


SQL Queries

This section documents the SQL queries used to extract and analyze the flight data from the database. Each query corresponds to a specific analysis question and feeds into the visualizations presented above.

Query 1: Maximum Departure Delays by Airline

This query finds the worst departure delay for each airline and sorts results from smallest to largest.

SELECT l_ai_d.Name,
       MAX(a.DepDelayMinutes) AS MaxDepDelay
FROM al_perf AS a
JOIN L_AIRLINE_ID AS l_ai_d
  ON a.DOT_ID_Reporting_Airline = l_ai_d.ID
GROUP BY l_ai_d.Name
ORDER BY MaxDepDelay ASC;

Results: 17 airlines | Used in: Section 2 - Maximum Departure Delays


Query 2: Maximum Early Departures by Airline

This query identifies the earliest departures (in minutes) for each airline, sorted from largest to smallest early departure.

SELECT l_ai_d.Name,
       -MIN(a.DepDelay) AS EarlyDep
FROM al_perf AS a
JOIN L_AIRLINE_ID AS l_ai_d
  ON a.DOT_ID_Reporting_Airline = l_ai_d.ID
GROUP BY l_ai_d.Name
ORDER BY EarlyDep DESC;

Results: 17 airlines | Used in: Section 2 - Maximum Early Departures


Query 3: Flight Volume by Day of Week

This query ranks days of the week by total flight volume, with 1 being the busiest day.

SELECT l_w.Day AS DayOfWeek,
       COUNT(*) AS NumFlights,
       RANK() OVER (ORDER BY COUNT(*) DESC) AS FlightRank
FROM al_perf AS a
JOIN L_WEEKDAYS AS l_w
  ON a.DayOfWeek = l_w.Code
GROUP BY l_w.Day
ORDER BY FlightRank ASC;

Results: 7 days | Used in: Section 1 - Flight Volume by Day of Week


Query 4: Airport with Highest Average Departure Delay

This query identifies the single worst-performing airport based on average departure delay (treating early departures as 0 delay).

SELECT l_a.Name,
       l_a.Code AS Code,
       AVG(a.DepDelayMinutes) AS AverageDepDelay
FROM al_perf AS a
JOIN L_AIRPORT AS l_a
  ON a.Origin = l_a.Code
GROUP BY l_a.Name, l_a.Code
ORDER BY AverageDepDelay DESC
LIMIT 1;

Results: 1 airport | Used in: Section 3 - Worst Performing Airport Overall


Query 5: Worst Airport for Each Airline

This query uses CTEs to find each airline’s most problematic departure airport based on average delay.

WITH avg_delays AS (
  SELECT a.DOT_ID_Reporting_Airline AS airline_id,
         a.OriginAirportID AS airport_id,
         AVG(a.DepDelayMinutes) AS avg_delay
  FROM al_perf AS a
  GROUP BY airline_id, airport_id
),
max_delays AS (
  SELECT airline_id,
         MAX(avg_delay) AS max_avg_delay
  FROM avg_delays
  GROUP BY airline_id
)
SELECT lairline.Name AS AirlineName,
       lairport.Name AS AirportName,
       md.max_avg_delay AS MaxAvgDepDelay
FROM max_delays AS md
JOIN avg_delays AS ad
  ON md.airline_id = ad.airline_id
 AND md.max_avg_delay = ad.avg_delay
JOIN L_AIRLINE_ID AS lairline
  ON ad.airline_id = lairline.ID
JOIN L_AIRPORT_ID AS lairport
  ON ad.airport_id = lairport.ID
ORDER BY MaxAvgDepDelay DESC;

Results: 17 airline-airport combinations | Used in: Section 3 - Worst Airport for Each Airline


Query 6a: Total Cancelled Flights

This query counts the total number of cancelled flights in the dataset.

SELECT COUNT(*) AS NumCancelledFlights
FROM al_perf
WHERE Cancelled = 1;

Results: 14,488 cancelled flights | Used in: Section 4 - Total Cancellations


Query 6b: Most Frequent Cancellation Reason by Airport

This query identifies the primary cancellation reason at each departure airport.

WITH reason_counts AS (
  SELECT OriginAirportID AS airport_id,
         CancellationCode AS reason_code,
         COUNT(*) AS num_cancellations
  FROM al_perf
  WHERE Cancelled = 1
  GROUP BY airport_id, reason_code
),
max_counts AS (
  SELECT airport_id,
         MAX(num_cancellations) AS max_cancel
  FROM reason_counts
  GROUP BY airport_id
)
SELECT lairport.Name AS AirportName,
       rc.reason_code AS CancellationCode,
       lcancel.Reason,
       rc.num_cancellations AS NumCancellations
FROM reason_counts rc
JOIN max_counts mc
  ON rc.airport_id = mc.airport_id
 AND rc.num_cancellations = mc.max_cancel
JOIN L_AIRPORT_ID lairport
  ON rc.airport_id = lairport.ID
JOIN L_CANCELATION lcancel
  ON rc.reason_code = lcancel.code
ORDER BY NumCancellations DESC;

Results: 317 airports | Used in: Section 4 - Primary Drivers of Flight Cancellations


Query 7: Daily Flights with 3-Day Rolling Average

This query calculates a 3-day rolling average of flight volumes using a window function.

WITH daily_flights AS (
  SELECT FlightDate,
         COUNT(*) AS NumFlights
  FROM al_perf
  GROUP BY FlightDate
)
SELECT FlightDate,
       NumFlights,
       AVG(NumFlights) OVER (
         ORDER BY FlightDate
         ROWS BETWEEN 3 PRECEDING AND 1 PRECEDING
       ) AS AvgFlights_Preceding3Days
FROM daily_flights
ORDER BY FlightDate;

Results: 31 days | Used in: Section 1 - Daily Flights vs 3-Day Rolling Average


Key SQL Techniques Used
  • Window Functions: RANK() OVER and AVG() OVER for rankings and rolling averages
  • Common Table Expressions (CTEs): Multi-step queries for complex aggregations (Queries 5 & 6b)
  • Multiple JOINs: Connecting fact tables with lookup tables for readable results
  • Aggregate Functions: MAX(), MIN(), AVG(), COUNT() for summarization
  • Conditional Filtering: WHERE Cancelled = 1 to isolate specific subsets

Full SQL Script: All queries are available in SQL_exports/Nkemdibe_Okweye_no252_Mini_Project_Problems.sql